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Abstract 

Objects  in  visual  scenes  come  in  a  rich  variety  of  trans¬ 
formed  states.  A  few  classes  of  transformation  have  been 
heavily  studied  in  computer  vision:  mostly  simple,  para¬ 
metric  changes  in  color  and  geometry.  However,  transfor¬ 
mations  in  the  physical  world  occur  in  many  more  flavors, 
and  they  come  with  semantic  meaning:  e.g.,  bending,  fold¬ 
ing,  aging,  etc.  The  transformations  an  object  can  undergo 
tell  us  about  its  physical  and  functional  properties.  In  this 
paper,  we  introduce  a  dataset  of  objects,  scenes,  and  ma¬ 
terials,  each  of  which  is  found  in  a  variety  of  transformed 
states.  Given  a  novel  collection  of  images,  we  show  how  to 
explain  the  collection  in  terms  of  the  states  and  transforma¬ 
tions  it  depicts.  Our  system  works  by  generalizing  across 
object  classes:  states  and  transformations  learned  on  one 
set  of  objects  are  used  to  interpret  the  image  collection  for 
an  entirely  new  object  class. 


Input:  tomato  Discovered  states 


Figure  1.  Example  input  and  automatic  output  of  our  system: 

Given  a  collection  of  images  from  one  category  (top-left,  subset  of 
collection  shown),  we  are  able  to  parse  the  collection  into  a  set  of 
states  (right).  In  addition,  we  discover  how  the  images  transform 
between  antonymic  pairs  of  states  (bottom-left).  Here  we  visualize 
the  transformations  using  the  technique  described  in  Section  4.5. 


1.  Introduction 

Much  work  in  computer  vision  has  focused  on  the  prob¬ 
lem  of  invariant  object  recognition  [9,  7],  scene  recognition 
[32,  33],  and  material  recognition  [26].  The  goal  in  each  of 
these  cases  is  to  build  a  system  that  is  invariant  to  all  within- 
class  variation.  Nonetheless,  the  variation  in  a  class  is  quite 
meaningful  to  a  human  observer.  Consider  Figure  1 .  The 
collection  of  photos  on  the  left  only  shows  tomatoes.  An 
object  recognition  system  should  just  see  “tomato”.  How¬ 
ever,  we  can  see  much  more:  we  can  see  peeled,  sliced,  and 
cooked  tomatoes.  We  can  notice  that  some  of  the  tomatoes 
are  riper  than  others,  and  some  are  fresh  while  others  are 
moldy. 

Given  a  collection  of  images  of  an  object,  what  can  a 
computer  infer?  Given  1000  images  of  tomatoes,  can  we 
learn  how  tomatoes  work?  In  this  paper  we  take  a  step  to¬ 
ward  that  goal.  From  a  collection  of  photos,  we  infer  the 
states  and  transformations  depicted  in  that  collection.  For 
example,  given  a  collection  of  photos  like  that  on  the  left  of 
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Figure  1 ,  we  infer  that  tomatoes  can  be  undergo  the  follow¬ 
ing  transformations,  among  others:  ripening,  wilting,  mold¬ 
ing,  cooking,  slicing,  and  caramelizing.  Our  system  does 
this  without  having  ever  seen  a  photo  of  a  “tomato”  during 
training  (although  overlapping  classes,  such  as  “fruit”,  may 
be  included  in  the  training  set).  Instead  we  transfer  knowl¬ 
edge  from  other  related  object  classes. 

The  problem  of  detecting  image  state  has  received  some 
prior  attention.  For  example,  researchers  have  worked  on 
recognizing  image  “attributes”  (e.g.,  [10],  [24],  [23],  [11]), 
which  sometimes  include  object  and  scene  states.  However, 
most  of  this  work  has  dealt  with  one  image  at  a  time  and  has 
not  extensively  catalogued  the  state  variations  that  occur  in 
an  entire  image  class.  Unlike  this  previous  work,  we  focus 
on  understanding  variation  in  image  collections. 

In  addition,  we  go  beyond  previous  attributes  work  by 
linking  up  states  into  pairs  that  define  a  transformation:  e.g., 
raw^cooked,  roughs  smooth,  defalted^  inflated.  We  ex¬ 
plain  image  collections  both  in  terms  of  their  states  (unary 
states)  and  transformations  (antonymic  state  pairs).  In  ad- 
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dition,  we  show  how  state  pairs  can  be  used  to  extract  a 
continuum  of  images  depicting  the  full  range  of  the  trans¬ 
formation  (Figure  1  bottom-left). 

Understanding  image  collections  is  a  relatively  unex¬ 
plored  task,  although  there  is  growing  interest  in  this  area. 
Several  methods  attempt  to  represent  the  continuous  vari¬ 
ation  in  an  image  class  using  subspaces  [  ],  [22]  or  mani¬ 
folds  [13].  Unlike  this  work,  we  investigate  discrete,  name- 
able  transformations,  like  crinkling ,  rather  than  working 
in  a  hard-to-interpret  parameter  space.  Photo  collections 
have  also  been  mined  for  storylines  [15]  as  well  as  spatial 
and  temporal  trends  [18],  and  systems  have  been  proposed 
for  more  general  knowledge  discovery  from  big  visual  data 
Y1  ],  [1],  [3].  Our  paper  differs  from  all  this  work  in  that 
we  focus  on  physical  state  transformations ,  and  in  addition 
to  discovering  states  we  also  study  state  pairs  that  define  a 
transformation. 

To  demonstrate  our  understanding  of  states  and  transfor¬ 
mations,  we  test  on  three  tasks.  As  input  we  take  a  set  of 
images  depicting  a  noun  class  our  system  has  never  seen  be¬ 
fore  (e.g.,  tomato ;  Figure  1).  We  then  parse  the  collection: 

•  Task  1  -  Discovering  relevant  transformations:  What 
are  the  transformations  that  the  new  noun  can  undergo 
in  (e.g.,  a  tomato  can  undergo  slicing ,  cooking ,  ripen¬ 
ing. ,  etc). 

•  Task  2  -  Parsing  states:  We  assign  a  state  to  each  image 
in  the  collection  (e.g.,  sliced ,  raw,  ripe). 

•  Task  3  -  Finding  smooth  transitions:  We  recover  a 
smooth  chain  of  images  linking  each  pair  of  antonymic 
states. 

Similarly  to  previous  works  on  transfer  learning  [6,  4, 
19,  2:  ]  ,  our  underlying  assumption  is  the  transferrability 
of  knowledge  between  adjectives  (state  and  transformation) 
(see  Fig  2).  To  solve  these  problems,  we  train  classifiers  for 
each  state  using  convolutional  neural  net  (CNN)  features 
[8].  By  applying  these  classifiers  to  each  image  in  a  novel 
image  set,  we  can  discover  the  states  and  transformations 
in  the  collection.  We  globally  parse  the  collection  by  inte- 


Melted  chocolate  Melted  sugar  Melted  butter 


Figure  2.  Transferrability  of  adjective:  Each  adjective  can  apply 
to  multiple  nouns.  Melted  describes  a  particular  kind  of  state:  a 
blobby,  goopy  state.  We  can  classify  images  of  chocolate  as  melted 
because  we  train  on  classes  like  sugar  and  butter  that  have  similar 
appearance  when  melted. 


grating  the  per-image  inferences  with  a  conditional  random 
field  (CRF). 

Note  that  these  tasks  involve  a  hard  generalization  prob¬ 
lem:  we  must  transfer  knowledge  about  how  certain  nouns 
work,  like  apples  and  pears ,  to  an  entirely  novel  noun,  such 
as  banana.  Is  it  plausible  that  we  can  make  progress  on 
this  problem?  Consider  Figure  2.  Melted  chocolate ,  melted 
sugar ,  and  melted  butter  all  look  sort  of  the  same.  Although 
the  material  is  different  in  each  case,  “meltedness”  always 
produces  a  similar  visual  style:  smooth,  goopy,  drips.  By 
training  our  system  to  recognize  this  visual  style  on  choco¬ 
late  and  sugar ,  we  are  able  to  detect  the  same  kind  of  ap¬ 
pearance  in  butter.  This  approach  is  reminiscent  of  Freeman 
and  Tenenbaum’s  work  on  separating  style  and  content  [29]. 
However  whereas  they  focused  on  classes  with  just  a  single 
visual  style,  our  image  collections  contain  many  possible 
states. 

Our  contribution  in  this  paper  is  threefold:  (1)  introduc¬ 
ing  the  novel  problem  of  parsing  an  image  collection  into 
a  set  of  physical  states  and  transformations  it  contains  (2) 
showing  that  states  and  transformations  can  be  learned  with 
basic  yet  powerful  techniques,  and  (3)  building  a  dataset  of 
objects,  scenes,  and  materials  in  a  variety  of  transformed 
states. 

2.  States  and  transformations  dataset 

The  computer  vision  community  has  put  a  lot  of  ef¬ 
fort  into  creating  datasets.  As  a  result,  there  are  many 
great  datasets  that  cover  object  [9,  7,  31,  25,  20],  attribute 
[16,  1,  10],  material  [26],  and  scene  categories  [32,  33]. 
Here,  our  goal  is  to  create  an  extensive  dataset  for  char¬ 
acterizing  state  variation  that  occurs  within  image  classes. 
How  can  we  organize  all  this  variation?  It  turns  out  lan¬ 
guage  has  come  up  with  a  solution:  adjectives.  Adjectives 
modify  nouns  by  specifying  the  state  in  which  they  occur. 
Each  noun  can  be  in  a  variety  of  adjective  states,  e.g.,  rope 
can  be  short ,  long,  coiled ,  etc.  Surprisingly  little  previous 
work  in  computer  vision  has  focused  on  adjectives  [11,  17]. 

Language  also  has  a  mechanism  for  describing  transfor¬ 
mations:  verbs.  Often,  a  given  verb  will  be  related  to  one  or 
more  adjectives:  e.g.,  to  cook  is  related  to  cooked  and  raw. 
In  order  to  effectively  query  images  that  span  a  full  trans¬ 
formation,  we  organize  our  state  adjectives  into  antonym 
pairs.  Our  working  defining  of  a  transformation  is  thus  a 
pair  {adjective,  antonym}. 

We  collected  our  dataset  by  defining  a  wide  variety  of 
{adjective,  antonym,  noun}  triplets.  Certain  adjectives, 
such  as  mossy  have  no  clear  antonym.  In  these  cases,  we 
instead  define  the  transformation  as  simply  an  {adjective, 
noun}  pair.  For  each  transformation,  we  perform  an  image 
search  for  the  string  “adjective  noun”  and  also  “antonym 
noun”  if  the  antonym  exists.  For  example,  search  queries 
included  cooked  fish ,  raw  fish ,  and  mossy  branch. 


Fish  Room  Persimmon 


Figure  3.  Example  categories  in  our  dataset:  fish ,  room ,  and  persimmon.  Images  are  visualized  using  t-SNE  [30]  in  CNN  feature  space. 
For  visualization  purposes,  gray  boxes  containing  ground  truth  relevant  adjectives  are  placed  at  the  median  location  of  the  images  they 
apply  to.  Dotted  red  lines  connect  antonymic  state  pairs.  Notice  that  this  feature  space  organizes  the  states  meaningfully. 


2.1.  Adjective  and  noun  selection 

We  generated  2550  “adjective  noun”  queries  as  follows. 
First  we  selected  a  diverse  set  of  115  adjectives,  denoted 
A  throughout  the  paper,  and  249  nouns,  denoted  AT.  For 
nouns,  we  selected  words  that  refer  to  physical  objects,  ma¬ 
terials,  and  scenes.  For  adjectives,  we  selected  words  that 
refer  to  specific  physical  transformations.  Then,  for  each 
adjective,  we  paired  it  with  another  antonymic  adjective  in 
our  list  if  a  clear  antonym  existed. 

Crossing  all  115  adjectives  with  the  249  nouns  would  be 
prohibitively  expensive,  and  most  combinations  would  be 
meaningless.  Each  noun  can  only  be  modified  by  certain 
adjectives.  The  set  of  relevant  adjectives  that  can  modify  a 
noun  tell  about  the  noun’s  properties  and  affordances.  We 
built  our  dataset  to  capture  this  type  of  information:  each 
noun  is  paired  only  with  a  subset  of  relevant  adjectives. 

N-gram  probabilities  allow  us  to  decide  which  adjec¬ 
tives  are  relevant  for  each  noun.  We  used  Microsoft’s  Web 
N-gram  Services1  to  measure  the  probability  of  each  {adj 
noun}  phrase  that  could  be  created  from  our  lists  of  adjec¬ 
tives  and  nouns.  For  each  noun,  N  G  AT,  we  selected  ad¬ 
jectives,  A  e  A,  based  on  pointwise  mutual  information, 
PMI: 

PMI('1-">=‘» smm'  a> 

where  we  define  P(A,  TV )  to  be  the  probability  of  the  phrase 
“A  AT”.  PMI  is  a  measure  of  the  degree  of  statistical  associ¬ 
ation  between  A  and  TV. 

For  each  noun  TV,  we  selected  the  top  20  adjectives 
A  with  highest  min(PMI(A,  TV),  PMI(ant(A),  TV)),  where 
ant  (A)  is  the  antonym  of  A  if  it  exists  (otherwise  the  score 
is  just  PMI(A,N)).  We  further  removed  all  adjectives  from 

1http://research.microsoft.com/en-us/collaboration/focus/cs/web- 

ngram.aspx 


this  list  whose  PMI(A,  TV)  was  less  than  the  mean  value  for 
that  list.  This  gave  us  an  average  of  9  adjectives  per  noun. 

2.2.  Image  selection 

We  scraped  up  to  50  images  from  Bing  by  explicitly 
querying  {adj,  noun}  pair,  in  addition  to  querying  by  only 
noun.  While  we  scraped  with  an  exact  target  query,  the  re¬ 
turned  results  are  quite  often  noisy.  The  main  causes  of 
noise  is  {adj,  noun}  pairs  being  either  a  product  name,  a 
rare  combination,  or  a  hard  concept  to  be  visualized. 

Hence,  we  cleaned  up  the  data  through  an  online  crowd 
sourcing  service,  having  human  labelers  remove  any  images 
in  a  noun  category  that  did  not  depict  that  noun.  Figure  3 
shows  our  data  for  three  noun  classes,  with  relevant  adjec¬ 
tive  classes  overlaid. 

2.3.  Annotating  transformations  between  antonyms 

While  scraped  images  come  with  a  weak  state  label,  we 
also  collected  human  labeled  annotations  for  a  subset  of  our 
dataset  (218  {adj,  antonym  adj,  noun}  pairs).  For  these  an¬ 
notations,  we  had  labelers  rank  images  according  to  how 
much  they  expressed  an  adjective  state.  This  data  gives  us  a 
way  to  evaluate  our  understanding  of  the  full  transformation 
from  “fully  in  state  A’s  antonym”  to  “fully  in  state  A”  (re¬ 
ferred  to  as  ranking  ant  (A)  to  A  henceforth).  Annotators 
split  each  noun  category  into  4  sections  as  the  followings. 
We  give  examples  for  A  =  open  and  TV  =  door : 

•  “Fully  A”  -  For  example,  fully  open  door  images  fall 
into  this  category. 

•  “Between- A  and  ant  (A)”  -  Half-way  open  door  im¬ 
ages  fall  into  this  category. 

•  “Fully  ant(A)”  -  Fully  closed  door  images  fall  into 
this  category. 


•  “Irrelevant  image”  -  For  example,  an  image  of  broken 
door  lying  on  the  ground. 

We  ask  users  to  rank  images  accordingly  by  drag-and-drop. 

3.  Methods 

Our  goal  in  this  paper  is  to  discover  state  transforma¬ 
tions  in  an  image  collection.  Unlike  the  traditional  recog¬ 
nition  task,  rather  than  recognizing  an  object  (noun)  or 
one  attribute,  we  are  interested  in  understanding  an  object 
(noun)’s  states  and  the  transformations  to  and  from  those 
states.  There  are  various  scenarios  that  we  study  for  this: 
singe  image  state  classification,  relevant  transformation  re¬ 
trieval  from  the  image  collection,  and  ordering  by  transfor¬ 
mation. 

The  common  theme  is  to  learn  states  and  transformations 
that  can  generalize  over  different  nouns.  The  reason  behind 
this  generalization  criterion  is  from  the  fact  that  it  is  impos¬ 
sible  to  collect  all  training  examples  that  can  cover  the  en¬ 
tire  space  of  {noun}  x  {adjective}.  Hence,  in  the  follow¬ 
ing  problem  formulations,  we  always  assume  that  no  image 
from  the  specific  target  noun  has  been  shown  to  the  algo¬ 
rithm.  For  example,  no  apple  image  is  used  during  training 
if  we  want  to  order  images  for  the  transformation  to  sliced 
apple.  In  other  words,  we  follow  the  concept  of  transfer 
learning. 

3.1.  Image  state  classification 

First,  a  simple  task  is  to  classify  what  is  the  most  relevant 
adjective  that  describes  a  single  image.  Figure  4  shows  ex¬ 
amples  of  images  in  our  dataset.  Can  we  tell  that  the  dom¬ 
inant  state  of  Figure  4b  is  slicedl  Also,  can  we  tell  how 
sliced  the  apple  image  it  is?  As  mentioned  above,  we  put  a 
hard  constraint  that  we  never  saw  any  apple  image  (includ¬ 
ing  sliced  apple)  during  the  training  stage.  Our  goal  is  to 
learn  what  it  means  to  be  sliced  apart  from  all  other  nouns 
and  be  able  to  transfer  the  knowledge  to  a  new  category  ( e.g . 
apple)  and  infer  the  state  of  the  image. 


(a)  (b)  (c)  (d)  (e)  (f) 

tiny  sliced  inflated  cooked  open  clean 


huge  whole  deflated  raw  closed  dirty 

bear  apple  ball  fish  door  water 

Figure  4.  Examples  of  objects  in  a  variety  of  transformed 
states  and  their  antonym  states:  Notice  that  each  state  of  an 
object  has  at  least  one  antonym  state. 


Our  approach  for  solving  this  problem  is  training  a  lo¬ 
gistic  regression  model.  Let  N  E  AT  be  the  query  noun  that 
will  be  excluded  from  our  test  set.  Then,  using  all  non-TV 
images,  we  split  them  into  the  positive  and  negative  sets.  To 
train  a  classifier  for  adjective  A  e  A,  the  positive  set  is  all 
images  of  A,  and  the  negative  set  is  all  images  not  related  to 
A.  Then,  the  score  of  A  for  image  /,  denoted  g(A\I),  can 
be  easily  computed  by: 

g(A\I)  =  a(-w^f(I)),  (2) 

where  a  is  the  sigmoid  function,  /(/)  is  a  feature  vector  of 
/,  and  wa  is  a  weight  vector  trained  using  a  logistic  regres¬ 
sion  model. 

It  is  worth  noting  that  each  image  can  be  in  the  mix  of 
different  states  (e.g.  an  image  of  fish  can  be  sliced  and  raw). 
However,  for  the  simplicity,  we  assume  each  image  has  one 
dominant  state  that  we  want  to  classify. 

3.2.  Which  states  are  depicted  in  the  image  collec¬ 
tion? 
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Figure  5.  Discovering  transformations:  our  goal  is  to  find  the 
set  of  relevant  adjectives  depicted  in  a  collection  of  images  repre¬ 
senting  one  specific  noun.  In  this  figure,  we  want  to  predict  the 
transformations  that  describe  that  describe  a  collection  of  apple 
images. 


Our  second  problem  is  to  discover  the  relevant  set  of 
transformations  that  are  depicted  in  a  collection  of  images. 
Figure  5  describes  our  task.  We  are  given  a  set  of  apple  im¬ 
ages  scraped  from  the  web.  While  we  assume  our  algorithm 
has  never  seen  any  of  apple  image,  can  we  tell  if  this  im¬ 
age  collection  contains  the  transformations  between  pairs  of 
adjectives  and  their  antonyms  -  (sliced,  whole),  (chopped, 
whole),  and  (crisp,  sofa)l 

We  now  formalize  this  task.  We  want  to  find  the  best 
adjective  set,  {Aj}j^j,  that  can  describe  the  collection  of 
images,  { h}iex ,  representing  a  single  noun.  We  abbreviate 
this  set  as  Aj.  Then,  our  goal  is  to  predict  what  the  most 
relevant  set  of  adjectives  and  antonyms  describing  transfor¬ 
mations,  Aj,  for  the  given  collection  of  images.  In  this 
problem,  we  constrain  all  J  to  have  the  same  size.  More 
formally,  we  find  J  by  maximizing 

J  =  argmax  V  V  [ex^A^  +  eAfl(or.t(^)|/,)] .  (3) 
J',\J'\=kjeJ,ieX 

Rather  than  taking  the  sum  over  the  raw  g(-)  scores,  we  take 
the  exponential  of  this  value,  with  A  being  a  free  parame¬ 
ter  that  trades  off  between  how  much  this  function  is  like  a 


sum  versus  like  a  max.  In  our  experiments,  we  set  A  to  10. 
Thus,  only  large  values  of  g(Aj  \I{)  contribute  significantly 
to  making  Aj  appear  relevant  for  the  collection. 

3.3.  Collection  parsing 

Rather  than  classifying  each  image  individually,  we  can 
do  better  by  parsing  the  collection  as  a  whole.  This  is  anal¬ 
ogous  to  the  image  parsing  problem,  in  which  each  pixel  in 
an  image  is  assigned  a  label.  Each  image  is  to  a  collection 
as  each  pixel  is  to  an  image.  Therefore,  we  call  this  problem 
collection  parsing.  We  formulate  it  as  a  conditional  random 
field  (CRF)  similar  to  what  has  been  proposed  for  solving 
the  pixel  parsing  problem  (e.g.,  [27]).  For  a  collection  of 
images,  I,  to  which  we  want  to  assign  per-image  states  A, 
we  optimize  the  following  conditional  probability: 


logp(A|I)  =  Y^9(Ai\Ii)+ 

i 

A  ipiAi, Aj\Ii,Ij)  +  log Z, 

i,jeAf 


where  Z  normalizes,  g  serves  as  our  data  term,  and  the  pair¬ 
wise  potential  ^  is  a  similarity  weighted  Potts  model: 


il>(Ai,Aj\Ii,Ij)  =  l(Ai  /  Aj) 


£  +  e-7ll/U0-/(^')l|2 

£  +  1 


(5) 


3.4.  Discovering  transformation  ordering 

Each  image  in  our  dataset  depicts  an  object,  scene,  or 
material  in  a  particular  state.  Unfortunately,  since  images 
are  static,  a  single  image  does  not  explicitly  show  a  trans¬ 
formation.  Instead,  we  arrange  multiple  images  in  order 
to  identify  a  transformation.  At  present,  we  only  inves¬ 
tigate  a  simple  class  of  transformations:  transitions  from 
“any tony m  of  some  state  A”  to  “fully  in  state  A”  (ant  (A) 
to  A). 

Figure  6  shows  our  goal.  Given  a  set  of  images  and 
an  adjective  A,  we  sort  images  {/»}  based  on  q(A\L)  — 
g(ant(A)\Ii)  (Eqn.  2). 


Input 

whole 

Output 

sliced 

9 

Figure  6.  Discovering  transformation  orders:  given  a  particu¬ 
lar  adjective  A  and  a  collection  of  images,  our  goal  is  to  sort  the 
images  according  to  the  transformation  from  ant  (A)  to  A.  In  this 
figure,  we  order  images  from  whole  to  sliced.  Note  that  we  do  not 
use  any  apple  images  while  training. 


4.  Results 

We  evaluate  three  tasks:  1)  Identification  of  relevant 
transformations  for  an  image  collection.  2)  State  classifica- 
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Figure  7.  Example  results  on  discovering  states:  Subset  of  im¬ 
age  collection  for  each  noun  is  to  the  left  of  discovered  states. 
Our  system  does  well  on  foods,  scenes,  and  clothing  (left  three 
collections),  but  performs  more  poorly  on  objects  like  computer 
(bottom-right). 

tion  per  image.  3)  Ranking  images  by  transformation  from 
ant(A)  to  A. 

4.1.  Discovering  relevant  transformations 

To  implement  g(-)  (Equation  2),  we  used  logistic  regres¬ 
sions  trained  on  CNN  features  [8]  (Caffe  Reference  Ima- 
geNet  Model2,  layer  fc7  features).  Figure  7  shows  typical 
results  for  retrieval  sets  of  size  \J\  =  5  (Equation  3).  In 
order  to  ensure  most  effective  examples  can  be  used,  we 
priortize  negative  examples  from  nouns  that  contain  the  par¬ 
ticular  adj  we  are  interested  in.  This  type  of  generalizaiton 
technique  has  been  explored  in  [1(  ]  as  well. 

We  evaluated  transformation  discovery  in  an  image  col¬ 
lection  as  a  retrieval  task.  We  defined  the  ground  truth  rel¬ 
evant  set  of  transformations  as  those  {adjective,  antonym} 
pairs  used  to  scrape  the  images  (Section  2.1).  Our  retrieved 
set  was  given  by  Equation  3.  We  retrieved  sets  of  size 
|  J\  =  l..\A\.  We  quantify  our  retrieval  performance  by 
tracing  precision-recall  curves  for  each  noun.  mAP  over 
all  nouns  reaches  0.39  (randomly  ordered  retrieval:  mAP 
=  0.1 1).  Although  quantitatively  there  is  room  to  improve, 
qualitatively  our  system  is  quite  successful  at  transforma¬ 
tion  discovery  (Figure  7). 

In  Figure  9(a),  we  show  performance  on  several  meta¬ 
classes  of  nouns,  such  as  “metals”  (e.g.,  silver ,  copper , 
steel)  and  “food”  (e.g.,  salmon ,  chicken,  fish).  Our  method 
does  well  on  material  and  scene  categories  but  struggles 
with  many  object  categories.  One  possible  explanation  is 
that  the  easier  nouns  have  many  synonyms,  or  near  syn¬ 
onyms,  in  our  dataset.  To  test  this  hypothesis,  we  measured 
semantic  similarity  between  all  pairs  of  nouns,  using  the 
service  provided  by  [1  ].  In  Figure  9(b),  we  plot  semantic 
similarity  versus  AP  for  all  nouns.  There  is  indeed  a  corre¬ 
lation  between  synonymy  and  performance  (r  =  0.28):  the 
more  synonyms  a  noun  has,  the  easier  our  task.  This  makes 

2  http :  //  caffe  .berkeley  vision .  org 


since  because  it  is  easier  to  generalize  to  a  novel  noun  when 
the  training  set  contains  many  similar  nouns.  We  investigate 
our  ability  to  generalize  across  dissimilar  nouns  in  Section 
4.4. 


rankings  in  Section  4.5.  There  is  ample  room  for  improve¬ 
ment  on  this  difficult  task,  which  we  hope  will  inspire  future 
work. 


Figure  8.  Performance  of  CRF  over  various  parameters:  We 

show  the  results  of  our  CRF  method  on  the  collection  parsing  prob¬ 
lem.  The  parameters  A  and  7  correspond  to  those  in  Equation  4. 
Note  that  the  accuracy  improves  as  we  increase  the  weights  on 
pairwise  smoothing  term. 


4.2.  State  classification 

To  implement  state  classification  we  optimize  our  CRF 
model  from  Equation  4  using  the  method  from  [2].  This 
gives  us  a  maximum  a  posteriori  (MAP)  configuration  of 
states  per  image  in  the  collection.  We  evaluated  the  MAP 
classifications  by  measuring  mean  accuracy  at  correctly 
classifying  the  state  of  each  image  across  all  collections. 
We  used  the  states  from  our  set  of  discovered  transforma¬ 
tions  as  the  label  space  for  the  CRF.  It  is  also  possible  to 
run  the  CRF  with  all  adjectives  as  candidates  classes.  How¬ 
ever,  using  all  adjectives  does  worse:  mean  accuracy  drops 
from  12.46%  to  11.72%.  Thus,  the  relevant  transformation 
discovering  acts  as  a  holistic  context  that  improves  the  clas¬ 
sification  of  each  individual  image. 

In  Figure  8  we  show  how  performance  varies  as  a  func¬ 
tion  not  the  CRF  parameters  A  and  7  (Section  3.3).  The 
rightmost  subplot  shows  the  mean  accuracy  as  a  function  of 
a  grid  of  settings  of  A  and  7.  The  left  two  subplots  show 
the  accuracy  profile  for  A  setting  7  to  its  best  value  and  vice 
versa.  We  can  consider  the  data  term  g  alone  by  setting 
A  to  zero.  This  performs  worse  than  when  we  include  the 
pairwise  potential,  demonstrating  that  parsing  the  collection 
holistically  is  better  than  treating  each  image  individually. 

Even  though  state  classification  per  image  is  quite  noisy, 
because  each  image  collection  contains  many  images,  these 
noisy  classifiers  add  up  to  give  fairly  accurate  characteri¬ 
zations  of  entire  image  collections,  as  demonstrated  by  the 
success  of  discovering  the  relevant  transformations  in  the 
collection  (Section  4.1). 

4.3.  Ranking  images  by  transformation 

We  also  evaluated  how  well  we  perform  at  ranking  im¬ 
ages  from  ant  (A)  to  A.  As  ground  truth,  we  use  the  trans¬ 
formation  annotations  provided  by  human  labelers  (Section 
2.3).  We  only  consider  images  that  fall  in  the  Mid- A  sec¬ 
tion.  Our  method  achieves  p  =  0.46.  We  visualize  the 
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Figure  9.  (a)  Mean  AP  at  discovering  states  for  different  classes 
of  noun,  (b)  Performance  correlates  with  the  semantic  similarity 
between  the  training  set  and  the  test  noun,  but  this  does  not  fully 
explain  why  some  image  collections  are  easier  to  understand  than 
others. 
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Figure  10.  Performance  versus  percent  of  training  nouns  used,  tak¬ 
ing  nouns  in  order  of  semantic  similarity  to  test  noun.  Performance 
increases  as  more  similar  nouns  are  included  in  the  training  set,  but 
is  still  well  above  chance  even  when  the  training  set  only  includes 
dissimilar  nouns  (chance  =  gray  line,  estimated  by  assigning  ran¬ 
dom  scores  to  images). 


Old  Computer  < - ►  New  Computer 


Figure  11.  Some  classes  generalize  poorly,  others  generalize  bet¬ 
ter.  In  this  example  we  visualize  how  performance  degrades  as 
semantically  similar  nouns  are  removed  from  the  training  sets  for 
computer  and  room  (visualized  using  the  “transformation  taxi” 
method  from  Section  4.5).  Notice  that  old^new  computer  de¬ 
grades  rapidly  whereas  cluttered^empty  may  be  more  easily  gen- 
eralizable  across  dissimilar  noun  classes. 

4.4.  How  well  does  our  system  generalize  across 
dissimilar  noun  classes? 

Our  tasks  are  all  about  transferring  knowledge  from  a 
training  set  of  noun  classes  to  novel  nouns.  Sometimes 
this  task  is  very  easy:  transferring  from  laptop  to  computer 
might  not  require  much  generalization.  To  test  how  well 
our  method  generalizes,  we  restricted  our  training  set,  for 
each  query  noun,  to  only  include  nouns  that  a  certain  se¬ 
mantic  distance  from  the  query  noun.  As  in  Figure  9(b), 
we  again  use  semantic  similarity  scores  obtained  from  [12]. 
In  Figure  10,  we  plot  how  performance  increases,  on  each 
of  our  tasks,  as  the  training  set  grows  to  include  more  and 
more  nouns  that  are  semantically  similar  to  the  query  noun. 
Clearly,  including  synonyms  helps,  but  performance  is  still 
well  above  chance  even  when  the  training  set  only  contains 
nouns  quite  distinct  from  the  query  noun:  our  system  can 
generalize  state  transformations  over  fairly  dissimilar  noun 
classes. 

We  visualize  the  effect  of  removing  semantically  sim¬ 
ilar  nouns  in  Figure  11.  Here  we  show  ranked  trans¬ 
formations  for  old^new  computer  and  cluttered^ empty 
room ,  using  the  visualization  method  described  in  Section 

4.5.  As  we  restrict  the  training  set  to  include  fewer  and 
fewer  similar  nouns,  the  qualitative  results  degrade,  as  did 
the  quantitative  results.  However,  some  classes  general¬ 
ize  better  than  others.  For  example,  old^new  computer 
may  rely  on  having  very  similar  examples  in  the  train¬ 


ing  set  (in  particular,  old^new  laptop )  in  order  perform 
well;  removing  laptop  from  the  training  set  has  a  big  ef¬ 
fect.  On  the  other  hand,  many  classes  can  undergo  the  trans¬ 
formation  cluttered^ empty  and  this  transformation  tends 
to  look  alike  between  classes:  a  busy  textural  scene  be¬ 
comes  flat  and  homogeneous  in  appearance.  Correspond¬ 
ingly  cluttered^ empty  room  is  less  rapidly  affected  by  re¬ 
moving  similar  nouns  from  the  the  training  set. 

4.5.  Visualizing  transformations 

A  transformation  can  be  visualized  by  finding  a  smooth 
sequence  of  images  from  a  starting  state  (A)  to  a  trans¬ 
formed  ending  state  (ant (A)).  We  use  a  method  similar 
to  “image  taxis”  [14]. 

First,  we  convert  the  input  image  collection  to  a  graph. 
Each  image  is  connected  to  its  ^-nearest  neighbors  in  fea¬ 
ture  space  (k  =  5  in  our  implementation).  For  adjective 
A,  we  find  a  path  V  through  the  graph  that  optimizes  the 
following  cost  function: 

x  \v\ 

argmin  <?(A|/S)  +  g(ant(A)\It)  +  — t^II^  —  ||i, 

v  \r\  i=1 

(6) 

where  g(-)  is  given  by  Equation  2,  s  is  the  starting  node  of 
V,  and  t  is  the  ending  node.  For  features  we  use  our  ad¬ 
jective  classifier  scores:  f*  =  and  fj(h )  = 

g(Aj\Ii).  In  addition,  we  multiply  the  feature  channel  for 
A  and  ant  (A)  by  a  constant  (20  in  our  implementation)  in 
order  to  encourage  smoothness  most  of  all  in  that  channel. 
This  cost  function  says:  1)  starting  image  should  be  A,  2) 
ending  image  should  be  highly  ant  (A),  and  3)  path  should 
be  smooth  in  feature  space.  This  cost  can  be  optimized  ef¬ 
ficiently  using  Djistra’s  algorithm.  As  an  additional  con¬ 
straint  we  only  consider  values  of  5  among  the  top  5  images 
according  to  g(A\Is),  and  t  among  the  top  5  images  accord- 
ing  to  g(ant(A)\It). 

Example  transformation  visualizations  are  shown  in  Fig¬ 
ure  12.  Here  we  show  several  transformations  each  for  sev¬ 
eral  noun  classes.  For  simple  color  and  transformations, 
such  as  dark^bright  and  cluttered^  empty,  the  method  is 
reasonably  effective.  For  geometric  transformations,  such 
as  deflated^  inflated,  the  results  are  much  worse.  Future 
work  should  focus  on  capturing  these  difficult  types  of 
transformations. 

5.  Conclusion 

In  this  paper,  we  have  introduced  the  novel  problem  of 
discovering  and  characterizing  states  and  transformations  in 
image  collections.  We  have  shown  that  simple  yet  powerful 
techniques  are  sufficient  to  make  progress  on  this  problem. 
We  will  publicly  release  our  dataset  and  code  to  promote 
further  work  on  this  difficult  problem. 
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Figure  12.  Example  transformation  visualizations:  each  transformation  was  discovered  from  the  image  collection  of  a  noun  class  that 
was  not  included  in  the  algorithm’s  training. 
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